Import useful upstream exo PRs#16
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2f4fb54649
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Patched the Codex review findings in 575c595:
Validation run locally:
|
|
Cherry-picked Trivial conflict resolved: combined the new Verified clean: Branch is now 15 commits ahead of |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 33d725205a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Two fixes for PR #16 round 2 (Codex): 1. P1: Skip quantized caches for non-PP make_cache models. ``make_kv_cache`` had a single-node safeguard (``_model_is_pipeline_parallel``) on the make_cache-LESS branch, but the make_cache-aware branch above it unconditionally replaced ``KVCache`` entries with ``QuantizedKVCache`` whenever ``EXO_KV_CACHE_BITS`` was set. Models that expose ``make_cache()`` (e.g. Gemma3 with mixed attention layers) and run single-node would therefore crash at runtime with:: <class 'mlx_lm.models.cache.QuantizedKVCache'> does not yet support batching with history because mlx-lm's single-node ``BatchGenerator`` calls ``_merge_caches`` on every step and ``QuantizedKVCache`` doesn't implement ``merge``. Fix: apply the same ``_model_is_pipeline_parallel`` guard to the make_cache branch and emit the same warning when the env var is ignored. 2. P2: Verify peer file integrity before marking download complete. ``download_one`` in ``peer_shard_downloader.py`` marked peer downloads successful as soon as ``n_read == expected_size``, with no content-integrity check. A peer serving wrong bytes with the right length (stale/corrupt/malicious) was therefore silently accepted as model data, causing hard-to-diagnose inference failures or bad outputs. Fix: after the peer download completes, fetch the authoritative etag/hash from HuggingFace via ``file_meta()`` and validate the downloaded file via ``calc_hash``. On mismatch the file is removed and the caller falls back to direct HF download. Trusts HF as canonical source rather than peer-advertised hash to defend against malicious peers that lie about both content and hash. ``file_meta`` adds one HEAD round-trip per file; ``fetch_file_list_with_cache`` already requires HF connectivity at this code path so the network requirement is not new. The semantics now match the direct HuggingFace download path which has done identical validation since import.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6babb76dca
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Address Codex round-2 P1s on PR #16. P1: Reuse ``ignore_patterns`` when selecting peer download files - ``download_shard`` (download_utils.py:983) excludes ``original/*`` and ``metal/*`` because HuggingFace never downloads them. The peer path applied ``allow_patterns`` only, so any repo containing those paths (e.g. Llama 3.x) had a required-files list that included files the peer never had locally; the strict ``peer_info missing => fail`` check then aborted the entire transfer and forced a HF fallback for every download. Pass the same ``ignore_patterns=["original/*", "metal/*"]`` into ``filter_repo_objects`` to match selection. P1: Honor offline mode in peer file-list fetch - ``_try_peer_download`` hard-coded ``skip_internet=False`` when calling ``fetch_file_list_with_cache``, so offline-configured nodes still reached out to HuggingFace before downloading from a LAN peer. On cold/offline nodes without a cached file list this raised, the ``except: return None`` exited early, and the peer download could not even start. Add an ``offline`` parameter to ``PeerAwareShardDownloader.__init__`` (defaulting to ``False``) and thread the existing ``DownloadCoordinator.offline`` value through ``exo_shard_downloader`` so the peer file-list fetch honors the same offline contract as ``ResumableShardDownloader``. Tests - ``test_offline_flag_defaults_to_false`` / ``..._propagates`` cover the constructor wiring directly. - ``test_try_peer_download_passes_offline_to_fetch_file_list`` patches the import binding and asserts ``skip_internet=True`` is forwarded when the downloader is constructed with ``offline=True``. - ``test_try_peer_download_filters_ignore_patterns`` records every ``filter_repo_objects`` call and asserts that the peer path requested ``ignore_patterns=["original/*", "metal/*"]`` -- the exact set ``download_shard`` uses.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f48d5f001e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
… port Codex P1 (round 3): peer downloads were calling ``file_meta()`` for every file even when the coordinator ran with ``--offline`` / ``EXO_OFFLINE=true``. Any failure to reach HF (the entire point of offline mode) was treated as an integrity-check failure, the peer-fetched bytes were deleted, and the cold offline node was left with no path to complete model sync. When the downloader is in offline mode we now trust the LAN peer's bytes (size already enforced by ``download_file_from_peer``) and skip the HF canonical-hash call entirely. Online mode still validates against HF. Codex P2 (round 3): the peer-download listener was hard-coded to a single module-level constant, so a same-host multi-node deployment crashed on the second process with ``address already in use``. Add a ``--peer-download-port`` CLI flag (default ``EXO_PEER_DOWNLOAD_PORT``) and thread it through ``Node`` -> ``Worker`` and ``PeerFileServer``, replacing the import-time constant. Cluster-wide convention: every node uses the same value (peer discovery still uses the local value as the assumed remote port). Cross-node port advertisement via state is a documented follow-up. Tests: - ``TestPeerDownloadIntegrityCheckRespectsOfflineMode`` covers both modes: offline must not call ``file_meta`` and must keep the bytes, online must still call ``file_meta``.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6154483f54
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| peer_file_server = PeerFileServer( | ||
| host="0.0.0.0", | ||
| port=args.peer_download_port, | ||
| models_dir=EXO_DEFAULT_MODELS_DIR, |
There was a problem hiding this comment.
Serve peer files from all writable model directories
The peer file server is hard-wired to EXO_DEFAULT_MODELS_DIR, but downloads are placed via select_download_dir_for_shard() across EXO_MODELS_DIRS; when a node stores a model in a non-default writable directory (custom models path, low-space fallback, or default excluded as read-only), /status and /files will not expose that model and peers will always fall back to HuggingFace. This silently disables the new peer-download path for valid multi-directory deployments.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Resolved. Resolved by commit fe02cea6. The peer file server now serves files from every writable model directory via _locate_all_model_dirs; downloads landing in any of EXO_DEFAULT_MODELS_DIR / EXO_DOWNLOAD_DIRECTORIES are advertised to peers and downloadable from peers.
| else: | ||
| return PeerEndpoint( |
There was a problem hiding this comment.
Prefer RDMA edge before returning socket endpoint
This function returns immediately on the first non-RDMA edge for the peer, so if topology iteration yields a socket edge before an RDMA edge, the same peer is mislabeled as socket and never gets RDMA priority. That breaks the intended "RDMA first" peer ordering and can consistently select slower peers depending on edge insertion order.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Resolved. Resolved by commit fe02cea6 (Pick RDMA edges and serve from every model dir for peer downloads). _resolve_endpoint_for_peer now collects ALL edges between the pair before choosing, prefers the first RDMA edge if one exists, and only falls back to a socket edge when no RDMA edge is present. See peer_state.py:97-126.
This addresses two Codex P2 findings on PR #16. `_resolve_peer_endpoint` returned on the *first* topology edge it visited for a peer, so when ``out_edges`` happened to yield the SocketConnection before the RDMAConnection (insertion order is not a stable contract on the topology graph), the peer was silently labelled ``socket`` and lost its RDMA priority in the peer ordering. The lookup now scans all edges for a peer, prefers RDMA whenever any RDMA edge exists (using the companion socket address for the actual TCP connect), and only falls back to ``socket`` when no RDMA edge is present. The unused `_find_socket_ip` helper that was meant as a fallback for the RDMA branch is removed -- its behaviour is folded into the new scan. `PeerFileServer` was hard-wired to ``EXO_DEFAULT_MODELS_DIR``, but ``select_download_dir_for_shard`` legitimately writes shards across ``EXO_MODELS_DIRS`` (custom paths, low-disk fallback) and we also have shards available in ``EXO_MODELS_READ_ONLY_DIRS`` mounts. Pre-fix, any model that landed outside the default directory was invisible to /status and /files, so peers always fell back to HuggingFace and the new peer path was a no-op for valid multi-directory deployments. The server now takes ``models_dirs`` (a sequence) and probes every configured root in caller-specified priority, checking each candidate against ``_resolve_child`` for path-traversal safety. Adds regression coverage: * `test_peer_state.py` covers RDMA-first ordering for both edge insertion orders and the socket-only / RDMA-only edge cases. * `TestPeerFileServerMultipleDirectories` covers serving from a secondary writable directory and a read-only mount, plus the constructor's empty-list rejection.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fe02cea608
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| # Identity (config) | ||
| EXO_NODE_ID_KEYPAIR = EXO_CONFIG_HOME / "node_id.keypair" | ||
| EXO_NODE_ID_KEYPAIR = EXO_CACHE_HOME / "node_id.keypair" |
There was a problem hiding this comment.
Keep node identity keypair out of cache storage
Saving EXO_NODE_ID_KEYPAIR under EXO_CACHE_HOME makes node identity vulnerable to normal cache eviction/cleanup, which causes nodes to come up with a new peer ID and breaks the intended persistence of cluster identity across restarts. Identity material should remain in a persistent config/state location rather than cache.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Resolved. Resolved by commit 09d0279b (Reject oversized peer partials; relocate node-ID keypair to config dir). EXO_NODE_ID_KEYPAIR now defaults under EXO_CONFIG_HOME rather than EXO_CACHE_HOME, so node identity is no longer subject to cache eviction. A migration helper (_migrate_legacy_node_id_keypair) moves any pre-existing keypair from the cache location to the config location on first startup.
| chunk_size = 8 * 1024 * 1024 # 8MB, matching HF download | ||
|
|
||
| try: | ||
| while n_read < expected_size and poll_count < max_poll_attempts: |
There was a problem hiding this comment.
Reject oversized stale partials before peer resume
The resume loop only runs while n_read < expected_size, so if an existing .partial is already larger than expected_size (stale/corrupt prior download), this path skips downloading and later renames that file as a successful result. In offline mode (where hash verification is intentionally skipped), this can silently accept incorrect model bytes and produce broken inference artifacts.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Resolved. Resolved by commit 09d0279b. The peer-download resume loop now rejects oversized stale partial files: if partial_size > expected_size, the partial is deleted and the download restarts from byte 0 rather than appending a fresh body to a stale buffer.
Two Codex round-(N+1) P1 fixes for PR #16: 1. peer_download.py:120 -- "Reject oversized stale partials before peer resume". The resume loop ran ``while n_read < expected_size``, so any pre-existing ``.partial`` larger than ``expected_size`` skipped the loop entirely and the final ``rename`` accepted the bad bytes as the canonical download. In offline mode (where peer download integrity check is intentionally skipped) this would permanently poison the model cache. Discard the stale oversized partial up front and restart from zero on this peer. 2. constants.py:73 -- "Keep node identity keypair out of cache storage". ``EXO_NODE_ID_KEYPAIR`` was rooted at ``EXO_CACHE_HOME``, which is subject to normal cache cleanup (e.g. ``trash ~/.cache/exo``); a wipe silently regenerated a fresh peer ID and broke cluster membership / mDNS routes. Move it to ``EXO_CONFIG_HOME`` (matching the existing ``test_node_id_in_config_dir`` invariant) and add a one-shot transparent migrator in ``router._migrate_legacy_node_id_keypair`` so existing nodes retain identity after the upgrade. Regression tests: * ``test_oversized_stale_partial_is_discarded_and_retransferred`` -- pre-fix would rename the junk ``.partial`` as the downloaded weights file; post-fix the file is re-fetched and matches the canonical bytes. * ``test_legacy_keypair_is_migrated_to_new_location``, ``test_migration_is_idempotent_when_new_location_already_present``, ``test_migration_skipped_when_no_legacy_file``, ``test_get_node_id_keypair_uses_migrated_legacy_keypair``. X-Orchestraitor-Plan: ecosystem_repo_standardization_aeee88ba X-Agent-Platform: cursor-claude-opus-4.7
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 09d0279baf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| path: str | bytes | PathLike[str] | PathLike[bytes] = EXO_NODE_ID_KEYPAIR, | ||
| legacy_path: str | bytes | PathLike[str] | PathLike[bytes] | None = ( | ||
| EXO_LEGACY_NODE_ID_KEYPAIR |
There was a problem hiding this comment.
Preserve per-process node identity isolation
Loading and reusing a single on-disk keypair from the default EXO_NODE_ID_KEYPAIR path makes concurrent exo processes on the same host share the same NodeId, which breaks cluster semantics (for example, peer discovery paths that exclude peer_node_id == node_id and routing that expects unique node IDs). This is a regression from the previous per-process Keypair.generate() behavior and directly conflicts with the new same-host multi-node workflow introduced in this commit (distinct ports but still one shared identity file).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Resolved. Resolved by commit 231335a0 (Scope node-ID keypair per process and migrate inside the file lock). The default EXO_NODE_ID_KEYPAIR path is now scoped per-process via _node_id_keypair_scope(args), so concurrent exo processes on the same host load distinct keypair files.
| if legacy_path is not None: | ||
| _migrate_legacy_node_id_keypair(resolved_path, Path(str(legacy_path))) |
There was a problem hiding this comment.
Move legacy keypair migration inside the file lock
Calling _migrate_legacy_node_id_keypair() before acquiring FileLock introduces a startup race: two processes can both pass the existence checks, one can generate/write a new keypair under the lock, and the other can still overwrite it via legacy replace() outside the lock. That leaves processes in the same launch with divergent in-memory identities vs. persisted identity, causing unstable peer IDs across restarts.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Resolved. Resolved by commit 231335a0. _migrate_legacy_node_id_keypair is now called inside the FileLock block, so two processes cannot both pass the existence check and both attempt the migration.
Codex P1/P2 (PR #16 round-(N+2), router.py:297, router.py:322): the same-host multi-node workflow this PR introduces (distinct peer-download ports per process) requires distinct NodeIds per process so peer-discovery's self-skip and routing's unique-NodeId invariants hold. Concurrent startups also raced on the legacy cache->config keypair migration because it ran before FileLock. - Add ``process_scope`` parameter to ``get_node_id_keypair`` that is folded into the on-disk filename (``node_id.<scope>.keypair``). Single-process deployments default to None and keep the existing shared file; ``main.py`` passes ``args.peer_download_port`` so multi-process same-host runs land on distinct identities. - Move the legacy migration call inside the ``FileLock`` so two processes can't both pass the existence check and race into divergent in-memory vs. on-disk identities. - Legacy file remains unscoped: the first process to migrate adopts the operator's existing identity; later processes (other scopes) start with fresh keypairs, which is exactly what per-process isolation requires. - Add regression tests covering distinct scopes producing distinct keypairs, scope stability across calls, scoped legacy adoption, and a structural check that the migration runs inside the FileLock.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 231335a096
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # ``NodeId``s) so peer-discovery's ``peer_node_id == | ||
| # node_id`` self-skip and routing's unique-NodeId | ||
| # invariants continue to hold. | ||
| keypair = get_node_id_keypair(process_scope=args.peer_download_port) |
There was a problem hiding this comment.
Stop reusing one scoped keypair when peer download is disabled
This always scopes get_node_id_keypair by args.peer_download_port, even when --no-downloads or --no-peer-download disables the peer file server. In that mode, operators can legitimately run multiple same-host nodes with the default peer-download port (no bind conflict), and both processes will load the same node_id.<port>.keypair, producing identical NodeIds and breaking election/routing assumptions that node IDs are unique. Scope identity with a value that is unique in all multi-process modes (or only use peer-download-port scoping when peer download is actually enabled).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Resolved. Resolved by the existing scope construction at main.py:472-479: the scope concatenates libp2p_port, api_port, AND peer_download_port, and at least one of {api_port, libp2p_port} must differ between two same-host processes (each is a distinct socket bind). Two --no-peer-download processes with the same default api_port/libp2p_port would conflict at the api or libp2p socket bind first, surfacing the conflict before they ever reach the keypair load.
| elif r.status in (200, 206): | ||
| async with aiofiles.open( | ||
| partial_path, "ab" if n_read > 0 else "wb" | ||
| ) as f: |
There was a problem hiding this comment.
Require 206 for ranged peer-download resumes
When resuming (n_read > 0), the client sends a Range header but still accepts HTTP 200 and appends the body to the existing partial file. A server is allowed to ignore Range and return full content with 200, which here will duplicate bytes, push n_read past expected_size, and still rename the oversized file as success; in offline mode this can silently poison model files because hash verification is skipped. On resume, treat non-206 as a restart/failure (or truncate and rewrite from zero) instead of appending.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Resolved. Resolved by commit f34534c2 (Combine listening ports for keypair scope; restart on 200-on-resume). When n_read > 0 (resume) the client now requires HTTP 206; a 200 response triggers the partial file to be deleted and the download to restart from byte 0, preventing prepended-old + appended-new corruption.
Codex P1 (PR #16 round-(N+3), main.py:74): the previous scope used ``args.peer_download_port`` only. With ``--no-downloads`` / ``--no-peer-download`` the peer file server doesn't bind, so two same-host processes can both keep the default ``peer_download_port`` and would then load the same scoped keypair file -- producing identical ``NodeId``s and breaking election/routing's unique-NodeId invariants. The new ``_node_id_keypair_scope`` helper combines libp2p, api, and peer-download ports: at least one of those MUST differ between two same-host processes (each is a distinct local socket bind), so the resulting scope is always per-process unique while remaining stable across restarts of the same configuration. Codex P1 (PR #16 round-(N+3), peer_download.py:162): on resume ``download_file_from_peer`` sends a ``Range`` header but accepted HTTP 200 and appended to the existing partial. A non-compliant peer server is allowed to ignore Range and return full content with 200, which would duplicate bytes, push ``n_read`` past ``expected_size``, and -- because offline mode skips hash verification -- silently poison the model cache by renaming the oversized file as success. Now we treat 200-on-resume as a restart: discard the partial, reset ``n_read = 0``, and the next loop iteration re-fetches from zero. Add regression tests: - ``TestNodeIdKeypairScope`` covers the per-process scope helper: distinct libp2p / api / peer-download ports each yield distinct scopes; the same args yield the same scope; and the original bug (same default peer_download_port with peer-download disabled) is now isolated by libp2p_port differences. - ``test_resume_with_200_response_discards_partial_and_restarts`` stands up a tiny aiohttp server that always returns 200 (even for ranged requests), primes a partial file, and asserts the client discards the partial, restarts from zero, and lands the canonical bytes matching ``expected_size``.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f34534c23b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| main.py:74). | ||
| """ | ||
| return ( | ||
| f"libp2p-{args.libp2p_port}.api-{args.api_port}.peer-{args.peer_download_port}" |
There was a problem hiding this comment.
Include resolved libp2p port in node-ID scope
The scope string is derived from configured ports only, so --libp2p-port 0 always contributes the literal 0 even though each process actually binds a different ephemeral port at runtime. In worker-only runs where API and peer-download listeners are disabled, two same-host processes can share identical scope strings and therefore load the same keypair file, causing duplicate NodeIds and breaking routing/election assumptions. Use the resolved bound libp2p port (or another guaranteed per-process discriminator) when the configured port is 0.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Resolved. Resolved by commit ffcb67f4 (PR #16 R(N+8) P1: address libp2p-port=0 scope collision). _node_id_keypair_scope now folds os.getpid() into the scope when args.libp2p_port == 0, so two same-host processes binding ephemeral libp2p ports get distinct scopes (and therefore distinct node-ID keypair files).
| chunk = await r.content.read(chunk_size) | ||
| if not chunk: | ||
| break | ||
| written = await f.write(chunk) | ||
| n_read += written |
There was a problem hiding this comment.
Reject peer responses that exceed expected file size
The download loop keeps appending bytes until EOF and only checks n_read < expected_size afterward, so an oversized peer response is accepted as success (n_read >= expected_size) and then renamed into the model cache. In offline mode this bypasses hash verification entirely, so a peer serving a larger stale/wrong blob can silently poison local weights. Stop writing once expected_size is reached and treat any extra bytes as a failure.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Resolved. Resolved by commit ffcb67f4 (PR #16 R(N+8) P1: oversized peer responses). The download loop now compares n_read + len(chunk) > expected_size during the read and aborts immediately with RuntimeError, so an oversized peer response cannot trickle past the post-loop guard.
… peer responses main.py: when --libp2p-port 0 is set, the configured value is the literal 0 even though each process binds a different ephemeral port at runtime. Two same-host worker-only processes (no API, no peer download) sharing the default api/peer ports would otherwise produce identical scope strings and load the same on-disk keypair file, breaking the unique-NodeId invariant. Fold os.getpid() into the scope when libp2p_port == 0; the trade-off (ephemeral identity for ephemeral ports) is the right semantic since the operator opted into ephemeral binding by setting libp2p_port=0. peer_download.py: bound the inner read by 'expected_size - n_read' and treat any extra bytes as a peer protocol violation. Pre-fix the loop kept appending until EOF and only checked n_read < expected_size afterwards, so an oversized response (peer serving a stale/wrong blob) was accepted as success and renamed into the cache. In offline mode hash verification is skipped, so this silently poisoned local weights. New tests: - test_libp2p_port_zero_uses_pid_for_per_process_isolation: verifies the scope contains 'pid-<getpid>' when libp2p_port=0. - test_libp2p_port_zero_in_two_processes_yield_distinct_scopes: monkeypatches os.getpid to simulate two same-host processes both binding libp2p_port=0 with identical api/peer ports and asserts the scopes diverge. - test_oversized_peer_response_is_rejected_and_restarted: stands up a bad aiohttp peer that always serves canonical+'POISONED' bytes and asserts the client never lands the trailing junk in the cache.
|
@codex review |
The constant is the default for built-in models with known model cards, which are trusted. Custom models added via API already default to trust_remote_code=False in ModelCard.fetch_from_hf(). The CLI flag overrides custom models only. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
X-Orchestraitor-Task: exo-upstream-pr-import X-Orchestraitor-Plan: import-useful-upstream-prs X-Agent-Platform: codex
Cherry-picked from upstream 701838a (Alex Cheema, exo-explore/exo). Resolves trivial dataclass conflict in src/exo/main.py: combined `peer_file_server: PeerFileServer | None = None` (this branch's peer-to-peer download field) with `_libp2p_port: int` (mDNS commit's new field). Both fields kept; constructor positional args reordered to match. Type-checked clean (basedpyright 0 errors). Ruff clean.
Two fixes for PR #16 round 2 (Codex): 1. P1: Skip quantized caches for non-PP make_cache models. ``make_kv_cache`` had a single-node safeguard (``_model_is_pipeline_parallel``) on the make_cache-LESS branch, but the make_cache-aware branch above it unconditionally replaced ``KVCache`` entries with ``QuantizedKVCache`` whenever ``EXO_KV_CACHE_BITS`` was set. Models that expose ``make_cache()`` (e.g. Gemma3 with mixed attention layers) and run single-node would therefore crash at runtime with:: <class 'mlx_lm.models.cache.QuantizedKVCache'> does not yet support batching with history because mlx-lm's single-node ``BatchGenerator`` calls ``_merge_caches`` on every step and ``QuantizedKVCache`` doesn't implement ``merge``. Fix: apply the same ``_model_is_pipeline_parallel`` guard to the make_cache branch and emit the same warning when the env var is ignored. 2. P2: Verify peer file integrity before marking download complete. ``download_one`` in ``peer_shard_downloader.py`` marked peer downloads successful as soon as ``n_read == expected_size``, with no content-integrity check. A peer serving wrong bytes with the right length (stale/corrupt/malicious) was therefore silently accepted as model data, causing hard-to-diagnose inference failures or bad outputs. Fix: after the peer download completes, fetch the authoritative etag/hash from HuggingFace via ``file_meta()`` and validate the downloaded file via ``calc_hash``. On mismatch the file is removed and the caller falls back to direct HF download. Trusts HF as canonical source rather than peer-advertised hash to defend against malicious peers that lie about both content and hash. ``file_meta`` adds one HEAD round-trip per file; ``fetch_file_list_with_cache`` already requires HF connectivity at this code path so the network requirement is not new. The semantics now match the direct HuggingFace download path which has done identical validation since import.
Address Codex round-2 P1s on PR #16. P1: Reuse ``ignore_patterns`` when selecting peer download files - ``download_shard`` (download_utils.py:983) excludes ``original/*`` and ``metal/*`` because HuggingFace never downloads them. The peer path applied ``allow_patterns`` only, so any repo containing those paths (e.g. Llama 3.x) had a required-files list that included files the peer never had locally; the strict ``peer_info missing => fail`` check then aborted the entire transfer and forced a HF fallback for every download. Pass the same ``ignore_patterns=["original/*", "metal/*"]`` into ``filter_repo_objects`` to match selection. P1: Honor offline mode in peer file-list fetch - ``_try_peer_download`` hard-coded ``skip_internet=False`` when calling ``fetch_file_list_with_cache``, so offline-configured nodes still reached out to HuggingFace before downloading from a LAN peer. On cold/offline nodes without a cached file list this raised, the ``except: return None`` exited early, and the peer download could not even start. Add an ``offline`` parameter to ``PeerAwareShardDownloader.__init__`` (defaulting to ``False``) and thread the existing ``DownloadCoordinator.offline`` value through ``exo_shard_downloader`` so the peer file-list fetch honors the same offline contract as ``ResumableShardDownloader``. Tests - ``test_offline_flag_defaults_to_false`` / ``..._propagates`` cover the constructor wiring directly. - ``test_try_peer_download_passes_offline_to_fetch_file_list`` patches the import binding and asserts ``skip_internet=True`` is forwarded when the downloader is constructed with ``offline=True``. - ``test_try_peer_download_filters_ignore_patterns`` records every ``filter_repo_objects`` call and asserts that the peer path requested ``ignore_patterns=["original/*", "metal/*"]`` -- the exact set ``download_shard`` uses.
… port Codex P1 (round 3): peer downloads were calling ``file_meta()`` for every file even when the coordinator ran with ``--offline`` / ``EXO_OFFLINE=true``. Any failure to reach HF (the entire point of offline mode) was treated as an integrity-check failure, the peer-fetched bytes were deleted, and the cold offline node was left with no path to complete model sync. When the downloader is in offline mode we now trust the LAN peer's bytes (size already enforced by ``download_file_from_peer``) and skip the HF canonical-hash call entirely. Online mode still validates against HF. Codex P2 (round 3): the peer-download listener was hard-coded to a single module-level constant, so a same-host multi-node deployment crashed on the second process with ``address already in use``. Add a ``--peer-download-port`` CLI flag (default ``EXO_PEER_DOWNLOAD_PORT``) and thread it through ``Node`` -> ``Worker`` and ``PeerFileServer``, replacing the import-time constant. Cluster-wide convention: every node uses the same value (peer discovery still uses the local value as the assumed remote port). Cross-node port advertisement via state is a documented follow-up. Tests: - ``TestPeerDownloadIntegrityCheckRespectsOfflineMode`` covers both modes: offline must not call ``file_meta`` and must keep the bytes, online must still call ``file_meta``.
This addresses two Codex P2 findings on PR #16. `_resolve_peer_endpoint` returned on the *first* topology edge it visited for a peer, so when ``out_edges`` happened to yield the SocketConnection before the RDMAConnection (insertion order is not a stable contract on the topology graph), the peer was silently labelled ``socket`` and lost its RDMA priority in the peer ordering. The lookup now scans all edges for a peer, prefers RDMA whenever any RDMA edge exists (using the companion socket address for the actual TCP connect), and only falls back to ``socket`` when no RDMA edge is present. The unused `_find_socket_ip` helper that was meant as a fallback for the RDMA branch is removed -- its behaviour is folded into the new scan. `PeerFileServer` was hard-wired to ``EXO_DEFAULT_MODELS_DIR``, but ``select_download_dir_for_shard`` legitimately writes shards across ``EXO_MODELS_DIRS`` (custom paths, low-disk fallback) and we also have shards available in ``EXO_MODELS_READ_ONLY_DIRS`` mounts. Pre-fix, any model that landed outside the default directory was invisible to /status and /files, so peers always fell back to HuggingFace and the new peer path was a no-op for valid multi-directory deployments. The server now takes ``models_dirs`` (a sequence) and probes every configured root in caller-specified priority, checking each candidate against ``_resolve_child`` for path-traversal safety. Adds regression coverage: * `test_peer_state.py` covers RDMA-first ordering for both edge insertion orders and the socket-only / RDMA-only edge cases. * `TestPeerFileServerMultipleDirectories` covers serving from a secondary writable directory and a read-only mount, plus the constructor's empty-list rejection.
Two Codex round-(N+1) P1 fixes for PR #16: 1. peer_download.py:120 -- "Reject oversized stale partials before peer resume". The resume loop ran ``while n_read < expected_size``, so any pre-existing ``.partial`` larger than ``expected_size`` skipped the loop entirely and the final ``rename`` accepted the bad bytes as the canonical download. In offline mode (where peer download integrity check is intentionally skipped) this would permanently poison the model cache. Discard the stale oversized partial up front and restart from zero on this peer. 2. constants.py:73 -- "Keep node identity keypair out of cache storage". ``EXO_NODE_ID_KEYPAIR`` was rooted at ``EXO_CACHE_HOME``, which is subject to normal cache cleanup (e.g. ``trash ~/.cache/exo``); a wipe silently regenerated a fresh peer ID and broke cluster membership / mDNS routes. Move it to ``EXO_CONFIG_HOME`` (matching the existing ``test_node_id_in_config_dir`` invariant) and add a one-shot transparent migrator in ``router._migrate_legacy_node_id_keypair`` so existing nodes retain identity after the upgrade. Regression tests: * ``test_oversized_stale_partial_is_discarded_and_retransferred`` -- pre-fix would rename the junk ``.partial`` as the downloaded weights file; post-fix the file is re-fetched and matches the canonical bytes. * ``test_legacy_keypair_is_migrated_to_new_location``, ``test_migration_is_idempotent_when_new_location_already_present``, ``test_migration_skipped_when_no_legacy_file``, ``test_get_node_id_keypair_uses_migrated_legacy_keypair``. X-Orchestraitor-Plan: ecosystem_repo_standardization_aeee88ba X-Agent-Platform: cursor-claude-opus-4.7
Codex P1/P2 (PR #16 round-(N+2), router.py:297, router.py:322): the same-host multi-node workflow this PR introduces (distinct peer-download ports per process) requires distinct NodeIds per process so peer-discovery's self-skip and routing's unique-NodeId invariants hold. Concurrent startups also raced on the legacy cache->config keypair migration because it ran before FileLock. - Add ``process_scope`` parameter to ``get_node_id_keypair`` that is folded into the on-disk filename (``node_id.<scope>.keypair``). Single-process deployments default to None and keep the existing shared file; ``main.py`` passes ``args.peer_download_port`` so multi-process same-host runs land on distinct identities. - Move the legacy migration call inside the ``FileLock`` so two processes can't both pass the existence check and race into divergent in-memory vs. on-disk identities. - Legacy file remains unscoped: the first process to migrate adopts the operator's existing identity; later processes (other scopes) start with fresh keypairs, which is exactly what per-process isolation requires. - Add regression tests covering distinct scopes producing distinct keypairs, scope stability across calls, scoped legacy adoption, and a structural check that the migration runs inside the FileLock.
Codex P1 (PR #16 round-(N+3), main.py:74): the previous scope used ``args.peer_download_port`` only. With ``--no-downloads`` / ``--no-peer-download`` the peer file server doesn't bind, so two same-host processes can both keep the default ``peer_download_port`` and would then load the same scoped keypair file -- producing identical ``NodeId``s and breaking election/routing's unique-NodeId invariants. The new ``_node_id_keypair_scope`` helper combines libp2p, api, and peer-download ports: at least one of those MUST differ between two same-host processes (each is a distinct local socket bind), so the resulting scope is always per-process unique while remaining stable across restarts of the same configuration. Codex P1 (PR #16 round-(N+3), peer_download.py:162): on resume ``download_file_from_peer`` sends a ``Range`` header but accepted HTTP 200 and appended to the existing partial. A non-compliant peer server is allowed to ignore Range and return full content with 200, which would duplicate bytes, push ``n_read`` past ``expected_size``, and -- because offline mode skips hash verification -- silently poison the model cache by renaming the oversized file as success. Now we treat 200-on-resume as a restart: discard the partial, reset ``n_read = 0``, and the next loop iteration re-fetches from zero. Add regression tests: - ``TestNodeIdKeypairScope`` covers the per-process scope helper: distinct libp2p / api / peer-download ports each yield distinct scopes; the same args yield the same scope; and the original bug (same default peer_download_port with peer-download disabled) is now isolated by libp2p_port differences. - ``test_resume_with_200_response_discards_partial_and_restarts`` stands up a tiny aiohttp server that always returns 200 (even for ranged requests), primes a partial file, and asserts the client discards the partial, restarts from zero, and lands the canonical bytes matching ``expected_size``.
… peer responses main.py: when --libp2p-port 0 is set, the configured value is the literal 0 even though each process binds a different ephemeral port at runtime. Two same-host worker-only processes (no API, no peer download) sharing the default api/peer ports would otherwise produce identical scope strings and load the same on-disk keypair file, breaking the unique-NodeId invariant. Fold os.getpid() into the scope when libp2p_port == 0; the trade-off (ephemeral identity for ephemeral ports) is the right semantic since the operator opted into ephemeral binding by setting libp2p_port=0. peer_download.py: bound the inner read by 'expected_size - n_read' and treat any extra bytes as a peer protocol violation. Pre-fix the loop kept appending until EOF and only checked n_read < expected_size afterwards, so an oversized response (peer serving a stale/wrong blob) was accepted as success and renamed into the cache. In offline mode hash verification is skipped, so this silently poisoned local weights. New tests: - test_libp2p_port_zero_uses_pid_for_per_process_isolation: verifies the scope contains 'pid-<getpid>' when libp2p_port=0. - test_libp2p_port_zero_in_two_processes_yield_distinct_scopes: monkeypatches os.getpid to simulate two same-host processes both binding libp2p_port=0 with identical api/peer ports and asserts the scopes diverge. - test_oversized_peer_response_is_rejected_and_restarted: stands up a bad aiohttp peer that always serves canonical+'POISONED' bytes and asserts the client never lands the trailing junk in the cache.
…roots master/main.py P1: revert the 5s node_inactivity_timeout to 30s. Pre-fix the 5s window was too tight: any node that didn't publish NodeGatheredInfo within 5s (e.g. when fast probes are unavailable or delayed) was marked timed out and had its instances deleted in the same _plan loop. Because this loop now ticks every second, normal telemetry jitter caused repeated false-positive NodeTimedOut events and unnecessary instance churn. The 1s tick stays so the master reacts quickly when a node *does* genuinely time out. peer_file_server.py P2: search every configured root before selecting the model dir to serve. Pre-fix _locate_model_dir returned the first root that *contained* the model directory regardless of completeness. When an earlier writable root held a partial download and a later read-only mount held a complete copy, /status and /files only saw the partial tree -- peers thought the node had no canonical copy and fell back to HuggingFace despite a complete local copy on a different mount. New behavior: - /status unions across all matching roots; for duplicate filenames, complete files dominate partials, larger partials dominate smaller. - /files prefers the root holding a complete copy of the requested file; falls back to the largest partial when no root has it complete; returns 404 only when every root truly lacks the file. Added _locate_all_model_dirs helper that returns every root holding the model in priority order (writable before read-only). New tests: - test_status_unions_partial_in_first_root_with_complete_in_second: verifies /status surfaces the complete file from a later root and marks it as complete (not partial). - test_files_serves_complete_copy_when_first_root_has_only_partial: end-to-end via aiohttp client, verifies /files returns 200 with canonical bytes and X-Exo-Complete=true.
PeerFileServer.run() returned immediately after site.start(), so the task spawned by Node.run() (tg.start_soon(self.peer_file_server.run)) completed on the first event-loop tick. The parent task group considered the server 'done' the moment the listener bound, so when the node was cancelled there was no live coroutine for the task group to drive teardown -- the aiohttp listener kept its TCP socket open until process exit. That manifested as 'OSError: [Errno 48] address already in use' whenever a node was stopped/restarted in the same process (tests, embedded runs, systemd-style restart loops). Make run() block on anyio.sleep_forever() after starting the listener and run runner.cleanup() in a shielded finally on cancellation. The shield prevents the cancellation from killing cleanup itself (which would re-introduce the leak); the cast on self._runner placates the type-checker without weakening the runtime guard against double-drive when an external shutdown() call has already torn things down. Add lifecycle tests that verify (1) run() does not exit on its own after site.start(), and (2) the listening port is reusable immediately after task-group cancellation. Pre-fix the second test fails with EADDRINUSE; post-fix it passes.
The peer transfer path skipped every file whose declared size was 0 (e.g. .gitattributes markers, empty __init__.py shims, empty config sentinels), so the local snapshot diverged from the filtered file list HF would have produced. DownloadCompleted was published with an incomplete model directory and downstream loaders that probe for those marker files (chat-template adapters, processor configs that expect an empty sentinel) failed in ways that didn't point back at the peer step. After the canonical (non-empty) peer transfers succeed, materialize each zero-byte marker as a local empty file using aiofiles in append mode (so a resumed-from-partial marker isn't truncated). Marker materialization is intentionally deferred until the canonical transfer succeeds: a partial peer transfer must NOT leave behind orphan empty files that masquerade as a complete download and confuse the HF fallback's already-downloaded probe. If marker creation itself fails (filesystem permissions, etc.) we fall back to HF for the full snapshot integrity guarantee. Add two regression tests: - test_zero_byte_marker_files_materialized_after_peer_transfer: asserts both root- and nested-zero-byte markers land on disk with size 0 after a successful peer transfer. - test_zero_byte_files_not_created_when_canonical_transfer_fails: asserts a failing canonical transfer leaves the markers absent so the HF fallback starts from a clean directory state.
…ansfer Round-(N+10)'s fix lumped FileListEntry(size=None) together with size=0 markers and materialized both as empty local files. But fetch_file_list_with_cache returns size=None for files discovered via the safetensors index whose size wasn't in the HF API response -- those are *real weight shards*, not markers. Treating them as empty produced 'DownloadCompleted' snapshots with corrupted weights that failed only at load/inference time, far from the peer step that caused the corruption. Split the two cases: - size == 0 stays the marker materialization path (touch on disk). - size is None aborts the peer transfer with a logged warning so the HF fallback gets a real download path. A pre-pass over filtered_file_list detects size=None and missing peer_info BEFORE any download_one coroutines are constructed, so the early bail-out can't leak un-awaited coroutines. Add a regression test that builds a file list with one canonical real-size file plus one size=None weight shard and asserts the peer transfer aborts (returns None), the unknown-size file is NOT created locally, and download_file_from_peer is never called.
Codex flagged a P1 finding on PR #16 head 8a9bc7c at ``src/exo/routing/router.py:359``: > ``get_node_id_keypair()`` locks on ``resolved_path``, so two > same-host processes with different ``process_scope`` values > acquire different lock files and can run legacy migration > concurrently. In the cross-device fallback path (``replace()`` > fails, then copy bytes), both processes can read the same legacy > keypair before it is unlinked and each write it to its own > scoped file, producing duplicate node identities despite > different scopes. This breaks the unique-NodeId assumptions in > routing/election during concurrent startup when cache/config > live on different filesystems. Root cause: ``_scoped_keypair_path`` (R(N+2)) intentionally folds the process scope into the on-disk filename so each process gets a distinct keypair, and the existing FileLock at ``router.py:357`` is keyed on that scoped path. Result: two concurrent same-host processes with distinct scopes acquire DIFFERENT lock files and never serialize against each other -- so both can enter ``_migrate_legacy_node_id_keypair`` concurrently before either has a chance to unlink the legacy file. ``replace()`` on the legacy path is atomic (only one wins), but the cross-device fallback (the ``OSError`` branch on Linux when ``XDG_*`` dirs span filesystems) reads-then-writes-then-unlinks, and that read-then- write window is wide enough for a second scope to read the same legacy bytes and copy them into its own scoped file. Two scoped keypairs end up holding identical bytes, so two same-host nodes boot with the same ``NodeId`` -- this is exactly the routing / election unique-identity assumption Codex called out. Fix: wrap ``_migrate_legacy_node_id_keypair`` in a second ``FileLock`` keyed on the **legacy** path. The legacy path is intentionally NOT scope-suffixed (it pre-dates per-process scoping), so the legacy-keyed lock is the single global serialization point shared across every scope. Lock ordering is ``resolved_path`` (outer) -> ``resolved_legacy`` (inner): no deadlock is possible because the legacy lock is only acquired while holding the per-scope lock and is released before keypair I/O resumes, so unrelated scopes' keypair I/O isn't blocked on identity housekeeping. Documented behaviour preserved: the docstring's "first process boots wins" semantic is now actually enforced -- the winner of the legacy lock unlinks the legacy file, the loser's migrator no-ops on the absent legacy and proceeds to generate a fresh keypair as the per-process isolation invariant requires. Regression: ``test_legacy_migration_serialized_across_process_scopes``. Forces the cross-device fallback by monkey-patching ``Path.replace`` to raise ``OSError`` on the legacy path, then pauses inside the byte-copy ``write_bytes`` for thread A while thread B starts up with a distinct scope. Pre-fix both threads slip through their per-scope locks and end up with identical scoped keypairs (the test's ``scope_a_bytes != scope_b_bytes`` assertion fails with identical byte strings -- verified by running the test against the pre-fix code via ``git stash``); post-fix the legacy lock blocks thread B until thread A finishes adoption, and exactly one scope ends up holding the legacy bytes while the other generates a fresh identity.
Codex flagged a P2 finding on PR #16 head a1374b4 at ``src/exo/download/peer_shard_downloader.py:407``: > When a repo includes zero-byte files, this branch skips > ``download_one`` and later materializes marker files on disk, > but it never updates ``file_progress`` for those paths. As a > result, ``calculate_repo_progress()`` can leave the overall > status as ``not_started`` even after all bytes are present, so > ``_download_progress_callback`` never emits ``DownloadCompleted`` > immediately and the model can remain stuck in ``DownloadOngoing`` > until the periodic reconciliation loop runs. Root cause: the round-(N+10) zero-byte materialization path explicitly skips ``download_one`` for zero-byte files, but ``download_one`` is the SOLE writer of the per-file ``status="complete"`` transition (via its inner ``on_file_progress`` callback). The seeded entry for a zero-byte file at line 338 defaults to ``status="not_started"`` and stays that way through the materialization step, so the final ``calculate_repo_progress`` rollup -- which uses per-file statuses to derive the overall repo status -- emits ``RepoDownloadProgress(status="not_started")`` even though every file is on disk. ``_download_progress_callback`` in the ``DownloadCoordinator`` only publishes ``DownloadCompleted`` on ``status="complete"``, so the model's effective state stays at ``DownloadOngoing`` until the periodic ``_emit_existing_download_progress`` reconciliation loop notices the on-disk snapshot and force-promotes it. That delay can silently break test fixtures that expect synchronous completion (the bench harness, the API's polling ``StartDownload`` -> ``DownloadCompleted`` waiter) and forces an unnecessary HF re-validation step for offline / air-gapped users who restart between transfer and reconciliation. Fix: after materializing each zero-byte marker on disk, replace its seeded ``not_started`` entry in ``file_progress`` with a fully-complete ``RepoFileDownloadProgress(status="complete")``. ``RepoFileDownloadProgress`` is frozen so we replace the dict slot rather than mutating in place. The pattern mirrors the regular file completion path in ``download_one``'s ``on_file_progress`` callback (``status="complete" if is_renamed else "in_progress"`` resolves to ``complete`` for the final emission), preserving the documented progress-rollup invariant. Regression: ``test_zero_byte_files_marked_complete_in_progress_map`` exercises the zero-byte materialization fixture from the round-(N+10) test and additionally captures the final progress callback emission. The canonical safetensor's progress callback is now invoked by the test fake (matching the production ``download_file_from_peer`` contract), so the canonical entry's status flips to ``complete`` correctly. Pre-fix the rolled-up status is ``not_started`` because the zero-byte entries never transition; post-fix every per-file entry is ``complete`` and the rollup status is ``complete`` -- the regression Codex called out is locked in. Verified the test fails on the pre-fix coordinator via ``git stash``.
12d9aa3 to
19d6ce1
Compare
Squashes 30 commits from feature/drafter-in-process-tuning onto post-#16 main. Cumulative behavior changes: - Add Drafter abstraction + n-gram drafting strategy as a non-model alternative for speculative decoding (model | pipelined | ngram | none) - Wire drafter KV prefix cache + manual drafter prefill on the spec path - Bypass exo prefill+prefix cache for drafters: mlx_lm prefills target and drafter natively via speculative_generate_step - Add drafter benchmark harness (bench/drafter_bench.py) for A/B comparing draft modes - Forward draft_mode override and record n-gram acceptance for adaptive K - Per-request drafter overrides + warmup with configured K - Add drafter tuning, builder, and adaptive-K-gate test coverage Conflict resolution in src/exo/worker/engines/mlx/cache.py (make_kv_cache): - Keep #16's PP-gated KV_CACHE_BITS quantization block (operator-facing EXO_KV_CACHE_BITS feature for pipeline-parallel deployments) and the #16 PP gate in the no-make_cache() branch (prevents the documented BatchGenerator + QuantizedKVCache.merge() crash in single-node mode) - Adopt #19's docstring focusing on mlx_lm.speculative_generate_step compatibility - Drop the unconditional c.step = 16384 mutation on plain KVCache instances returned by model.make_cache() so the spec-decoding path receives caches whose allocation policy matches each architecture's declared shape (consistent with the V2 walk-back rationale in commit 241fe2c "Use mlx_lm native caches on spec-decoding path") - Retain qc.step = 16384 on QuantizedKVCache instances we construct ourselves in the PP path (internal to that allocation, doesn't affect spec-decoding which never reaches that branch)
Squashes 30 commits from feature/drafter-in-process-tuning onto post-#16 main. Cumulative behavior changes: - Add Drafter abstraction + n-gram drafting strategy as a non-model alternative for speculative decoding (model | pipelined | ngram | none) - Wire drafter KV prefix cache + manual drafter prefill on the spec path - Bypass exo prefill+prefix cache for drafters: mlx_lm prefills target and drafter natively via speculative_generate_step - Add drafter benchmark harness (bench/drafter_bench.py) for A/B comparing draft modes - Forward draft_mode override and record n-gram acceptance for adaptive K - Per-request drafter overrides + warmup with configured K - Add drafter tuning, builder, and adaptive-K-gate test coverage Conflict resolution in src/exo/worker/engines/mlx/cache.py (make_kv_cache): - Keep #16's PP-gated KV_CACHE_BITS quantization block (operator-facing EXO_KV_CACHE_BITS feature for pipeline-parallel deployments) and the #16 PP gate in the no-make_cache() branch (prevents the documented BatchGenerator + QuantizedKVCache.merge() crash in single-node mode) - Adopt #19's docstring focusing on mlx_lm.speculative_generate_step compatibility - Drop the unconditional c.step = 16384 mutation on plain KVCache instances returned by model.make_cache() so the spec-decoding path receives caches whose allocation policy matches each architecture's declared shape (consistent with the V2 walk-back rationale in commit 241fe2c "Use mlx_lm native caches on spec-decoding path") - Retain qc.step = 16384 on QuantizedKVCache instances we construct ourselves in the PP path (internal to that allocation, doesn't affect spec-decoding which never reaches that branch) Co-authored-by: jw-wcv <101585096+jw-wcv@users.noreply.github.com>
Cumulative #20 changes rebuilt onto post-#19 main via 3-way merge of #20's tip (ec255d6) into team-wcv/main. Replaces the earlier pure-diff approach which silently reverted ~12 #16 surfaces (peer_download.py, peer_file_server.py, mdns_announcer.py, router.py, main.py, constants.py, coordinator.py, etc.) -- all of those are now preserved at their main state. 57 files changed, +14,662 / -531. No conflict markers. Codex P1 (remote_drafter.py:673): symmetric drift guard. The original guard only rejected ``num_forwards > drafts_buffer_size`` (target K > drafter K) and missed the reverse direction. With drafter K > target K, the drafter would still send ``drafts_buffer_size`` uint32s while the target read only its smaller ``K + 1``, leaving trailing bytes in the socket and corrupting the next command frame. Fix carries ``target_drafts_buffer_size`` on every command frame (slot 7, previously reserved 0), and the drafter validates ``target_drafts_buffer_size == drafts_buffer_size`` BEFORE the session-cache lookup so a desynced wire fails with a protocol-level error instead of an incidental "unknown session" caused by garbage in slot 6. The original ``num_forwards > drafts_buffer_size`` guard is retained as defense in depth against an out-of-bounds slice in ``_run_drafter_*``. Codex P2 (batch_generator.py SequentialGenerator.close): explicit ``del self.draft_model`` + ``self.draft_model = None`` before the existing ``del self.model, self.tokenizer, self.group`` so MLX reclaims drafter weight buffers on close. Tests added by this PR: test_command_frame_round_trip: parametrized with new buffer field. test_command_frame_rejects_target_buffer_out_of_uint32_range: new. test_drafter_serve_loop_rejects_reverse_k_drift: regression test for the previously-uncovered direction. test_placement_drafter_asymmetric.py: 19 placement tests covering asymmetric drafter eligibility, RDMA reachability, socket-only fallback, and degradation events. Validation: basedpyright: 0 errors, 0 warnings, 0 notes ruff check: clean ruff format: clean pytest: 905 passed, 16 pre-existing test_placement.py failures (same on team-wcv/main; not introduced by this PR)
Cumulative #21 changes rebuilt onto post-#20 main (the freshly-rebuilt clean #20 base) via 3-way merge of #21's tip (628e497) into a fresh branch off the new clean #20 (71ef3ad). Replaces the earlier pure-diff approach which sat on top of the regression-laden old #20 and inherited the same silent reverts of #16 surfaces (peer_download, peer_file_server, mdns_announcer, router, main, constants, coordinator). All those #16 surfaces are preserved at their main/post-#19 state. 23 files changed, +3,867 / -575. No conflict markers. Codex P1 fixes (preserved): - worker/main.py: _should_drop_generation_task_at_drafter predicate guards generation tasks from DrafterRunner. - generate.py: clamp_num_draft_tokens_to_transport accepts both DrafterTransport and RemoteTransport via HasNumDraftTokens Protocol. - placement.py: directional reachability check (drafter -> target rank 0) via _drafter_node_is_reachable. - utils_mlx.py: mx_all_gather_tasks splits leftover into front-of-queue + back-of-queue so admission can't starve a fresh peer's delayed first page. - pipelined_drafter.py: _get_tokenizer_vocab_size walks four fallbacks (added_vocab + len, len, max id, base). - instances.py: target_peer_socket_port is Optional with default=None for backward compat with replay events. Production hardening additions: - TargetPeerFanout dedicated TCP fanout (sidesteps JACCL int/float wire-conflation bug entirely). - Backend-aware mx_broadcast_int_list (ring uses all_sum, JACCL uses send/recv). - DrafterAbortedError + DRAFT_ABORT_SENTINEL + _broadcast_abort resilience path. - Per-request metrics dict (proposed/accepted/rounds) + EXO_SPEC_DIAG-gated _spec_diag hooks. - Pre-allocated listener port + reserved_ports set threading so drafter/target-peer/jaccl-coordinator all draw distinct ports. - Kernel-vetted random_ephemeral_port_excluding(reserved) helper layered on the existing kernel-vetted random_ephemeral_port. - Spec-decode telemetry on ChatCompletionResponse (proposed_draft_tokens, spec_decode_rounds, accepted_prediction_tokens, rejected_prediction_tokens). - bench_compare.py and bench_concurrent.py for spec-decode benchmarking sweeps. Tests added: test_placement_drafter_asymmetric.py: test_asymmetric_drafter_and_target_peer_ports_are_distinct test_drafter_and_target_peer_avoid_jaccl_coordinator_port (both updated with explicit node_rdma_ctl=NodeRdmaCtlStatus(enabled=True) setup since post-#19 main's _all_rdma_ctl_enabled filter requires it). test_pipelined_drafter.py: TestDrafterAbortRecovery, TestGetTokenizerVocabSize (4-fallback walk regression coverage). test_remote_drafter.py: resilience tests for sticky _is_failed flag + fail-fast open_session check + try/except OSError wrappers. test_utils_mlx_broadcast.py: task-ID encoding, _MX_AGREE_MAX_TASKS admission, leftover-front/back-of-queue split coverage. test_spec_diag_gating.py: EXO_SPEC_DIAG diagnostic toggle. bench_compare.py / bench_concurrent.py: '# type: ignore' headers. Validation: basedpyright: 0 errors, 0 warnings, 0 notes. ruff check: clean. ruff format: clean. pytest: 945 passed, 16 pre-existing test_placement.py failures (same on team-wcv/main; not introduced by this PR).
…ocket transports (#20) * Asymmetric pipelined drafter: V1/V2/V3 transports + reverse-K-drift fix Cumulative #20 changes rebuilt onto post-#19 main via 3-way merge of #20's tip (ec255d6) into team-wcv/main. Replaces the earlier pure-diff approach which silently reverted ~12 #16 surfaces (peer_download.py, peer_file_server.py, mdns_announcer.py, router.py, main.py, constants.py, coordinator.py, etc.) -- all of those are now preserved at their main state. 57 files changed, +14,662 / -531. No conflict markers. Codex P1 (remote_drafter.py:673): symmetric drift guard. The original guard only rejected ``num_forwards > drafts_buffer_size`` (target K > drafter K) and missed the reverse direction. With drafter K > target K, the drafter would still send ``drafts_buffer_size`` uint32s while the target read only its smaller ``K + 1``, leaving trailing bytes in the socket and corrupting the next command frame. Fix carries ``target_drafts_buffer_size`` on every command frame (slot 7, previously reserved 0), and the drafter validates ``target_drafts_buffer_size == drafts_buffer_size`` BEFORE the session-cache lookup so a desynced wire fails with a protocol-level error instead of an incidental "unknown session" caused by garbage in slot 6. The original ``num_forwards > drafts_buffer_size`` guard is retained as defense in depth against an out-of-bounds slice in ``_run_drafter_*``. Codex P2 (batch_generator.py SequentialGenerator.close): explicit ``del self.draft_model`` + ``self.draft_model = None`` before the existing ``del self.model, self.tokenizer, self.group`` so MLX reclaims drafter weight buffers on close. Tests added by this PR: test_command_frame_round_trip: parametrized with new buffer field. test_command_frame_rejects_target_buffer_out_of_uint32_range: new. test_drafter_serve_loop_rejects_reverse_k_drift: regression test for the previously-uncovered direction. test_placement_drafter_asymmetric.py: 19 placement tests covering asymmetric drafter eligibility, RDMA reachability, socket-only fallback, and degradation events. Validation: basedpyright: 0 errors, 0 warnings, 0 notes ruff check: clean ruff format: clean pytest: 905 passed, 16 pre-existing test_placement.py failures (same on team-wcv/main; not introduced by this PR) * Fix 16 pre-existing JACCL placement tests by populating node_rdma_ctl These tests existed on team-wcv/main as known failures: they call place_instance(...) without populating the node_rdma_ctl parameter, which causes the JACCL placement filter _all_rdma_ctl_enabled (placement.py:309) to reject all RDMA cycles because rdma_ctl_status.get(node_id) returns None for every node. Fix: add node_rdma_ctl={node: NodeRdmaCtlStatus(enabled=True), ...} to each test's place_instance call so the JACCL request fails (or succeeds) for the right reason -- the Thunderbolt IPv4 preflight behavior the test was designed to exercise. No test logic, assertions, or topology setup modified -- only the missing keyword argument added. Tests fixed (all in src/exo/master/tests/test_placement.py): test_jaccl_placement_skips_thunderbolt_preflight_for_single_node_fallback test_jaccl_placement_accepts_maybe_ethernet_thunderbolt_bridge test_jaccl_placement_requires_repaired_thunderbolt_ipv4_paths test_jaccl_placement_falls_back_to_eligible_cycle_when_another_cycle_has_invalid_path test_jaccl_placement_prefers_eligible_cycle_among_multiple_size_2_cycles test_jaccl_placement_singleton_fallback_picks_best_node_regardless_of_tb test_jaccl_placement_allows_nodes_with_unknown_network_info test_jaccl_placement_allows_nodes_with_unclassified_interface_typing test_jaccl_placement_still_rejects_nodes_with_known_non_tb_paths test_jaccl_placement_rejects_nodes_with_only_loopback_unknown_typing test_jaccl_placement_allows_nodes_with_partial_interface_typing test_jaccl_placement_allows_bridge0_thunderbolt_with_unknown_typing test_jaccl_placement_allows_non_zero_bridge_index_thunderbolt test_jaccl_placement_rejects_nodes_with_vm_stack_bridges_and_primary_en test_jaccl_placement_rejects_nodes_with_unknown_en0_and_typed_wifi test_jaccl_placement_rejects_nodes_with_only_vpn_tunnel_unknown_typing Validation: basedpyright: 0 errors, 0 warnings, 0 notes ruff check: clean ruff format: clean pytest: 961 passed, 1 skipped (was 945 passed + 16 failed) * Address Codex re-review findings on rebuilt PR #20 Restores one regression that was lost during the 3-way merge rebuild plus 9 distinct findings from Codex's re-review. LOST REGRESSION (must restore from OLD #20 = ec255d6): - remote_drafter.py: symmetric K-drift guard via target_drafts_buffer_size in command frame slot 7. The new clean #20 base only had the forward-direction guard (num_forwards > drafts_buffer_size) and missed the reverse direction (drafter K > target K). target_drafts_buffer_size is now validated BEFORE the session-cache lookup in drafter_serve_loop so a desynced wire fails with a protocol error instead of an incidental "unknown session". P1 fixes: - placement.py: warm-drafter selection. Two-pass loop -- first pass picks an eligible candidate that has drafter weights on disk via _node_has_any_drafter_on_disk, second pass falls back to first eligible. Fixes FileNotFoundError in DrafterRunner._handle_load when first memory-eligible node is cold but a later one is warm. - utils_mlx.py: drafter listener bind retry. EADDRINUSE on target rank 0 is now retried via _bind_drafter_listener_with_retry (budget 8) with a fresh random_ephemeral_port draw each time; prior single-shot bind would hang ConnectToGroup on stale listeners. - placement.py: drafter-cycle reservation graceful degrade. Probe cycles_excluding_drafters against filter_cycles_by_memory first; fall back to unfiltered candidates if no reserved cycle has enough memory, so we degrade drafter rather than failing the instance. - placement.py: single-rank target stays on MlxRing. Pin Sharding.Pipeline + InstanceMeta.MlxRing for len(cycle) == 1 BEFORE the JACCL upgrade gate, so single-node ring + asymmetric drafter doesn't trip the JACCL TB-IPv4 preflight. - remote_drafter.py: assert OP_FORWARD reply width invariant (len(outputs) == num_forwards AND len(outputs) <= drafts_buffer_size) before padding so a regressed _run_drafter_forwards_remote can't silently desync the wire. P2 fixes: - api.py: false positive (cap is already 1024). - worker/main.py: drop path now calls runner.mark_task_dropped_locally so the planner doesn't re-select the same task every 100 ms tick. - generate.py: gate asymmetric drafter session-open on asymmetric_drafter_active AND draft_mode != "none" so demoted- to-none requests don't open sockets. - utils_mlx.py: single-rank asymmetric path now sets drafter_id from drafter_placement.drafter_model_id so GenerationStats.drafter_model_id is populated for telemetry. Tests added: - test_drafter_serve_loop_rejects_reverse_k_drift: regression test for the lost symmetric K-drift guard (drafter K > target K direction). - test_command_frame_round_trip / test_command_frame_rejects_*: coverage for the new target_drafts_buffer_size slot. - Two existing asymmetric placement tests renamed/updated to expect MlxRingInstance for single-rank targets (the P1.4 fix changes placement behavior). Validation: basedpyright: 0 errors, 0 warnings, 0 notes. ruff check: clean. ruff format: clean. pytest: 923 passed, 1 skipped (was 921 + 2 new tests). * Address Codex round-2 feedback on PR #20 bind-retry helper Codex flagged two issues with the round-1 _bind_drafter_listener_with_retry: P1 (utils_mlx.py:452): re-rolling the port broke the placement contract. The drafter dials DrafterPlacement.drafter_socket_port, which the master announced; switching the listener to a fresh random port made the listener accept on a port the drafter never tries, leaving the connection stalled until accept_drafter's 180s timeout. The retry must keep the SAME port -- the realistic EADDRINUSE case in cross-host deploys is TIME_WAIT residue from a previous instance on the same port, which clears within ~100 ms. P2 (utils_mlx.py:440): the retry caught every OSError, so non- collision errors (EAFNOSUPPORT for an IPv6 wildcard on an IPv4-only host, EACCES for a privileged port) got retried as if transient and surfaced after the budget as a misleading "ephemeral port range exhausted" error. Only EADDRINUSE is transient; everything else must surface immediately so the operator sees the real root cause. Changes: - utils_mlx.py: rename _bind_drafter_listener_with_retry to _bind_drafter_listener_same_port_retry; remove random_ephemeral_port re-roll; filter on errno.EADDRINUSE only; add 100ms sleep between retries for TIME_WAIT residue absorption. Final error message references the placement-announced port so the operator knows to re-place rather than guess at the bind address. - test_utils_mlx_bind_retry.py: 6 new unit tests covering same-port retry success, persistent EADDRINUSE budget exhaust, non-collision OSError immediate surface, and ordering when EADDRINUSE precedes a non-collision error. Validation: basedpyright: 0 errors, 0 warnings, 0 notes. ruff check: clean. ruff format: clean. pytest: 929 passed, 1 skipped, 201 deselected, 0 failures (+6). --------- Co-authored-by: jw-wcv <101585096+jw-wcv@users.noreply.github.com>
Summary
Upstream PRs included
Validation
uv run basedpyrightuv run ruff checkEXO_DASHBOARD_DIR=Development/Tooling/exo/dashboard/build uv run pytest -q-> 511 passed, 1 skipped, 199 deselectedEXO_DASHBOARD_DIR=Development/Tooling/exo/dashboard/build uv run pytest src/exo/shared/tests/test_xdg_paths.py src/exo/download/tests/test_peer_download.py src/exo/utils/tests/test_keyed_backoff.py src/exo/worker/tests/unittests/test_worker_instance_backoff.py src/exo/shared/tests/test_apply/test_apply_rdma_gating.py src/exo/master/tests/test_placement.py -q-> 61 passedNotes
nix fmtwas not run because Nix is not installed on this Mac. Official multi-user install requires an interactive sudo password; official single-user install is no longer supported on Darwin. The only post-check change was Python-only and was formatted withuv run ruff format src/exo/main.py.